Project 4 explores various approaches for implementing a movie recommendation model. Three model approaches were studied that recommend movies based on the similarity of their genre, reviewers, and movies. System I models made predictions based on genres with the code the results included below. System II models made predictions based on user similarity and movie rating similarities. The model implementations matched recommenderlabs results within the goals of the assignment.
The Project 4 data set contains about 1 million anonymous ratings of approximately 3,900 movies made by 6,040 users who joined MovieLens in 2000. The data set consists of two files with the following structures.
Ratings Dataset
cat(str(ratings))
## 'data.frame': 1000209 obs. of 4 variables:
## $ UserID : int 1 1 1 1 1 1 1 1 1 1 ...
## $ MovieID : int 1193 661 914 3408 2355 1197 1287 2804 594 919 ...
## $ Rating : int 5 3 3 4 5 3 5 5 4 4 ...
## $ Timestamp: int 978300760 978302109 978301968 978300275 978824291 978302268 978302039 978300719 978302268 978301368 ...
Movies Dataset
## 'data.frame': 3883 obs. of 4 variables:
## $ MovieID: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
## $ Genres : chr "Animation|Children's|Comedy" "Adventure|Children's|Fantasy" "Comedy|Romance" "Comedy|Drama" ...
## $ Year : num 1995 1995 1995 1995 1995 ...
The plots above show some characteristics of the data set.
Plot 1 (Reviews per Reviewer) shows how the 1,000,209 the distribution of reviews among the 6040 users. For a skeptic, users with over 500 reviews might arouse suspicion. For example, these users could be bots, professional reviewers, and work for movie companies. We removed these for Scheme B, which relies on reviewer opinions.
Plot 2 (Mean Rating by Reviewer) shows that users have different scales for rating movies. Centering the ratings for each user makes comparisons more relevant. Separately, ratings do not seem to fluctuate year-to-year, so no normalization is implemented.
Plot 3 (Reviews Count by Movie) shows the distribution for how many reviews each movie receives. The plot highlights that 290 movies have five or fewer reviews. These movies could have skewed ratings based on a few reviews. Weighting or omission would reduce skew.
Plot 4 (Review Count by Genre) shows the number of reviews by genre. The genre-based models should consider how the number of reviews might affect predictions.
Two methods for making recommendations using only the preferred genre of the requester were designed and tested. Scheme-A uses overall popularity, while Scheme-B ordains “genre experts” to pick movies.
System I uses the data set described in the previous section and consists of two data frames: ratings and movies. The rating data refers to films and users only by ID number. The movies data frame provides Title and Genre of the movies.
The top picks for each genre are displayed on a scrollable grid. Click the button on the right to view the code.
Scheme A uses popular opinion to choose movies based on genre. This model employs a simple approach that averages review ratings for each film and recommends the films with the highest average ratings for each genre. This method performs the following training steps:
Training
<>
Recommending
set.seed(3814)
popular_movies = ratings %>%
# Center user ratings
group_by(UserID) %>%
mutate(Rating = Rating - mean(Rating)) %>%
ungroup() %>%
# Summarize reviews by Movie (average of ratings)
group_by(MovieID) %>%
summarize(num_reviews = n(),
Rating = mean(Rating)) %>%
filter(num_reviews > 100, Rating > 0) %>%
ungroup() %>%
# Add movie details for Title and Genres
left_join(movies, by="MovieID") %>%
# Duplicate film entries for each of their genres
separate_rows(Genres, sep = '[|]') %>%
arrange(Genres, Rating)
# Create separate Top-10 lists for eache genre
popular_recs = popular_movies %>%
group_by(Genres, Title) %>%
summarize(.groups='drop_last', MovieID = first(MovieID),
Rating = mean(Rating)) %>%
group_by(Genres) %>%
slice_max(order_by = Rating, n = 10, with_ties = FALSE) %>%
ungroup() %>%
# Add urls for movie artwork
mutate(Image = paste0(small_image_url,MovieID, '.jpg?raw=true'))
The recommendations are shown below and include mostly blockbuster movies that most users would recognize.
# Display Results
movieGrid(popular_recs, "Genre Recommendations Based on Popularity")
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Action |
|
|
|
|
|
|
|
|
|
|
| Adventure |
|
|
|
|
|
|
|
|
|
|
| Animation |
|
|
|
|
|
|
|
|
|
|
| Children’s |
|
|
|
|
|
|
|
|
|
|
| Comedy |
|
|
|
|
|
|
|
|
|
|
| Crime |
|
|
|
|
|
|
|
|
|
|
| Documentary |
|
|
|
|
|
|
|
|
|
|
| Drama |
|
|
|
|
|
|
|
|
|
|
| Fantasy |
|
|
|
|
|
|
|
|
|
|
| Film-Noir |
|
|
|
|
|
|
|
|
|
|
| Horror |
|
|
|
|
|
|
|
|
|
|
| Musical |
|
|
|
|
|
|
|
|
|
|
| Mystery |
|
|
|
|
|
|
|
|
|
|
| Romance |
|
|
|
|
|
|
|
|
|
|
| Sci-Fi |
|
|
|
|
|
|
|
|
|
|
| Thriller |
|
|
|
|
|
|
|
|
|
|
| War |
|
|
|
|
|
|
|
|
|
|
| Western |
|
|
|
|
|
|
|
|
|
|
Scheme B diverges from general popularity and analyzes how “genre experts” rate the films. In this case, a genre expert is a person who has written many reviews for a particular genre. The expectation is that this approach can recommend some excellent but more ‘sophisticated’ movies. This scheme performs the following steps.
Training
set.seed(3814)
# Create lists of movie ratings by genre
ds_ratings = ratings %>%
# Center user ratings
group_by(UserID) %>%
mutate(Rating = Rating - mean(Rating)) %>%
ungroup() %>%
# Add movie info (title, genres)
left_join(movies, by="MovieID") %>%
mutate(rev_yr = year(as.Date(as.POSIXct(Timestamp, origin="1970-01-01")))) %>%
# Create duplicate reviews for each applicable genre
separate_rows(Genres, sep = '[|]')
# Find the "experts" for each genre - most reviews for a genre
genre_experts = ds_ratings %>%
group_by(Genres, UserID) %>%
summarize(.groups='drop_last', Num_reviews = n()) %>%
filter(Num_reviews < 500) %>%
slice_max(order_by = Num_reviews, n = 50)
kable(genre_experts %>%
group_by(Genres) %>%
summarize(.groups='drop',
Experts = n(),
Ave=round(mean(Num_reviews),0),
Max= max(Num_reviews),
Min=min(Num_reviews)),
'html', table.attr = "style='width:50%;'",
caption = "Genre Experts") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 12,
full_width = FALSE) %>%
add_header_above(c(" " = 2, "Reviews Per Expert" = 3)) %>%
scroll_box(width = "800px", height = "350px")
| Genres | Experts | Ave | Max | Min |
|---|---|---|---|---|
| Action | 51 | 288 | 368 | 257 |
| Adventure | 50 | 149 | 192 | 132 |
| Animation | 52 | 66 | 88 | 57 |
| Children’s | 50 | 127 | 187 | 105 |
| Comedy | 50 | 405 | 490 | 363 |
| Crime | 50 | 94 | 152 | 81 |
| Documentary | 60 | 22 | 41 | 16 |
| Drama | 50 | 400 | 493 | 345 |
| Fantasy | 57 | 42 | 54 | 38 |
| Film-Noir | 57 | 26 | 42 | 22 |
| Horror | 50 | 173 | 274 | 135 |
| Musical | 54 | 61 | 92 | 53 |
| Mystery | 53 | 55 | 99 | 43 |
| Romance | 51 | 181 | 391 | 148 |
| Sci-Fi | 52 | 172 | 225 | 151 |
| Thriller | 51 | 238 | 378 | 194 |
| War | 56 | 72 | 124 | 60 |
| Western | 51 | 38 | 57 | 30 |
Recommending
# Create a list of reviews for each genre written by the genre experts
expert_recs = genre_experts %>%
left_join(ds_ratings, by=c("Genres", "UserID")) %>%
# Calculate the mean Rating for each movie in a genre list
group_by(Genres, MovieID) %>%
summarize(.groups='drop_last',
Title = first(Title),
Rating = mean(Rating),
Num_reviews=n()) %>%
# Remove below average movies
filter(Rating > 0) %>%
# Pick top N rated movies in each list
slice_max(order_by = Rating, n = 10, with_ties = FALSE) %>%
# Add URL for artwork for output
mutate(Image = paste0(small_image_url,
MovieID,
'.jpg?raw=true'))
# Display the results on grid below
movieGrid(expert_recs, "Genre Recommendations Based on Expert Opinion")
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Action |
|
|
|
|
|
|
|
|
|
|
| Adventure |
|
|
|
|
|
|
|
|
|
|
| Animation |
|
|
|
|
|
|
|
|
|
|
| Children’s |
|
|
|
|
|
|
|
|
|
|
| Comedy |
|
|
|
|
|
|
|
|
|
|
| Crime |
|
|
|
|
|
|
|
|
|
|
| Documentary |
|
|
|
|
|
|
|
|
|
|
| Drama |
|
|
|
|
|
|
|
|
|
|
| Fantasy |
|
|
|
|
|
|
|
|
|
|
| Film-Noir |
|
|
|
|
|
|
|
|
|
|
| Horror |
|
|
|
|
|
|
|
|
|
|
| Musical |
|
|
|
|
|
|
|
|
|
|
| Mystery |
|
|
|
|
|
|
|
|
|
|
| Romance |
|
|
|
|
|
|
|
|
|
|
| Sci-Fi |
|
|
|
|
|
|
|
|
|
|
| Thriller |
|
|
|
|
|
|
|
|
|
|
| War |
|
|
|
|
|
|
|
|
|
|
| Western |
|
|
|
|
|
|
|
|
|
|
The two schemes provide different Top-5 lists, but some genres
overlap more than others. Drama, documentary, and thriller genres differ
the most different. The top-10 lists show more overlap. Both models
returned blockbuster movies with no small independent films. This was
expected in the Popularity model. The Expert Opinion model would benefit
from more specific classifications of experts, like sex and age, to
provide more targeted recommendations. However, user demographics were
not part of the data set.
In this section, collaborative models use similarity measures to determine recommendations.
For System II, The training data set consists of [n x m] (500 users x 3706 movies) realRatingMatrix sparse matrices that contain a user’s rating for each movie. An NA value is assigned if a user has not reviewed the movie. The test set consists of a user’s ratings for a movie ([1 x 3706] realRatingMatrix. The matrices are created using the script provided with the assignment.
The UBCF model creates recommendations based on ratings by reviewers with similar preferences as the requester receiving the recommendations. In application, the requester provides ratings for a handful of movies, which are then matched with reviewers with similar preferences. The model predicts ratings using the k nearest neighbors based on the cosine distance between reviewers and the requester. Finally, the requester receives a list of new movies based on the preferences of similar reviewers.
Training/Prediction
The training and prediction process are performed together and follow the outline from the assignment document.
Calculate a vector [1 x movies] with the weighted average of nearest neighbor movie ratings and their similarity to the requester using the formula below.
\[\hat{r}_{al} = \frac{1}{\sum_{i\in N(a)} s_{al}}\sum_{i \in N(a)}s_{ai}r_{il}\]
set.seed(3814)
nn = 20
d_train = normalize(train)
d_test = normalize(test)
train_mat = as(d_train, "matrix")
test_mat = as(d_test, "matrix")
# nn nearest neighbors = reviewers(train_mat) to requester(test_mat)
m_cosine = simil(train_mat, test_mat, method = "cosine")
sim = (m_cosine + 1) / 2
nn_idx = order(sim, decreasing = TRUE)[1:nn] # index for
# Create matrix of Top nn from train
T = train_mat[nn_idx,] # matrix of nearest neighbors (rows)
S = sim[nn_idx] # similarity of nearest neighbors (rows)
pred_unnorm = !is.na(T)
T = colSums(T * S, na.rm = TRUE) / colSums(pred_unnorm * S, na.rm = TRUE)
T = T * (is.na(test_mat))
mypred = ifelse(T == 0, NA, T + d_test@normalize$row$factors$means)
The UBCF code closely matched the RecommenderLab results and met the full requirements of the assignment.
recommender.UBCF <- Recommender(train, method = "UBCF",
parameter = list(normalize = 'center',
method = 'Cosine',
nn = 20))
p.UBCF <- predict(recommender.UBCF, test, type="ratings")
p.UBCF <- as.numeric(as(p.UBCF, "matrix"))
NA_diff = sum(is.na(p.UBCF) != is.na(mypred)) ### should be zero
Sum_diff = max(abs(p.UBCF - mypred), na.rm = TRUE) ### should be less than 1e-06
kable(data.frame("UBCF Comaparison Tests"= c('NA Response Difference',
'Sum of Response Differences'),
"Results"=c(as.character(NA_diff),
sprintf("%1.4e", (Sum_diff))),
"Goal"= c("0", "< 1e-06")),
align=rep('c', 2),
caption="UBCF Comparison to RecommendLab",
'html', table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 12,
full_width = FALSE)
| UBCF.Comaparison.Tests | Results | Goal |
|---|---|---|
| NA Response Difference | 0 | 0 |
| Sum of Response Differences | 8.4676e-07 | < 1e-06 |
The IBCF model uses a list of movies as input and returns predicted ratings of other movies based on similarity to the input list. This model differs from UBCF as it calculates the similarity between movies rather than reviewers. First, the model creates a [m x m] similarity matrix using the training data. Similarity is determined by the k nearest neighbors based on the cosine distance of reviews between movies. Then, the model returns new movies to the requester based on these similar movies. The source code is shown below.
nn = 30
train_mat = as(normalize(train), "matrix")
# similarity matrix M x M
sim_mat = as(simil(t(train_mat), t(train_mat), method = "cosine"), "matrix")
sim_mat = (sim_mat + 1) / 2
diag(sim_mat) = NA
# Keep nn nearest nearest neighbors
for(i in 1:nrow(sim_mat)) {
sidx = tail(order(sim_mat[i, ], decreasing = FALSE, na.last = FALSE), nn)
sim_mat[i, -sidx] <- NA
}
# Make into a sparse matrix to better track NAs
sim_mat = dropNA(sim_mat)
# Format active user movie preferences (test)
test_mat = as(test, "dgCMatrix")
test_mask = !is.na(as(test, "matrix")) # create mask for test_mat rows != NA
# Weighted average of similarities by active user input (test)
pred = (sim_mat %*% t(test_mat)) / (sim_mat %*% t(test_mat != 0))
mypred = as(t(pred), "matrix")
# remove user movie pick in active user input (test)
mypred[test_mask] = NA
Training/Prediction
The training and prediction process are performed together and follow the outline from the assignment document.
Predict the requester’s rating of movies based on the weighted average of movie similarities to the input list of their movie ratings. The formula is shown below.
\[\hat{r}_{al} = \frac{1}{\sum_{i\in S(l)} s_{li}}\sum_{i \in S(l)}s_{li}r_{ai}\]
The cosine distance algorithm differentiates between user preference similarities (UBCF) much better than movie similarities. For IBCF, most movies were rated as identical (similarity) to many (often hundreds of) other movies. Therefore, there was not enough differentiation to choose a definitive set. Hence, different model implementations are likely to choose different movies based on how they choose the top 30 most similar films (nearest neighbors) among many identical similarities. To match RecommenderLab results, the KNN algorithm had to match precisely.
recommender.IBCF <- Recommender(train, method = "IBCF",
parameter = list(normalize = 'center',
method = 'Cosine',
k = 30))
p.IBCF <- predict(recommender.IBCF, test, type="ratings")
p.IBCF <- as.numeric(as(p.IBCF, "matrix"))
## first output: should be less than 10
NA_diff = sum(is.na(p.IBCF) != is.na(mypred))
## second output: should be less than 10%
mydiff = abs(p.IBCF - mypred)
Sum_diff = sum(mydiff[!is.na(mydiff)] > 1e-6) / sum(!is.na(mydiff))
kable(caption="IBCF Comparison to RecommendLab",
data.frame("IBCF Comaparison Test"= c('NA Response Difference',
'Sum of Response Differences'),
"Results"=c(as.character(NA_diff),
sprintf("%1.4e", Sum_diff)),
"Goal"= c("< 10", "< 10%")),
align=rep('c', 2),
'html', table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover"),
font_size = 12,
full_width = FALSE)
| IBCF.Comaparison.Test | Results | Goal |
|---|---|---|
| NA Response Difference | 0 | < 10 |
| Sum of Response Differences | 0.0000e+00 | < 10% |